{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# FAISS for TEXT - Quick Start\n",
        "This notebook is a companion of chapter 2 of the \"Domain Specific LLms in Action\" book, author Guglielmo Iozzia, [Manning Publications](https://www.manning.com/), 2024.  \n",
        "The code in this notebook is to introduce readers to the [FAISS](https://faiss.ai/index.html) library. No hardware acceleration required to execute all the code cells.  "
      ],
      "metadata": {
        "id": "8adSWyluubZQ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Install the missing required packages in the Colab VM. Only FAISS for CPU, and [SentenceTransformers](https://www.sbert.net/) not available by default."
      ],
      "metadata": {
        "id": "asJhDfESvlBT"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "wm0PAXWduQ3V"
      },
      "outputs": [],
      "source": [
        "!pip install faiss-cpu sentence-transformers"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Import the necessary packages/classes."
      ],
      "metadata": {
        "id": "hnjVXGZfwBwD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "\"\"\"Module to cluster embeddings and create indices.\"\"\"\n",
        "import faiss\n",
        "\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sentence_transformers import SentenceTransformer"
      ],
      "metadata": {
        "id": "BYUItwlXwOcA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Set the data corpus for this example and put it into a Pandas DataFrame."
      ],
      "metadata": {
        "id": "YT8a4VjOwVK_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "data = [['His secret identity is Peter Parker', 'spiderman'],\n",
        "        ['A businessman and engineer who ' +\n",
        "         'runs the company Stark Industries',\n",
        "         'ironman'],\n",
        "        ['Superhuman spider-powers and abilities ' +\n",
        "         'after being bitten by a radioactive spider',\n",
        "         'spiderman'],\n",
        "        ['A frail man enhanced to the peak of human ' +\n",
        "         'physical perfection by an experimental super-soldier serum', 'captainamerica']\n",
        "        ]\n",
        "df = pd.DataFrame(data, columns = ['text', 'context'])"
      ],
      "metadata": {
        "id": "RKFC6tRswcDB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Get embeddings from the data corpus, generate a FAISS index and add the embeddings to it (after normalization)."
      ],
      "metadata": {
        "id": "sL0xASEt0ZbD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "text = df['text']\n",
        "encoder = SentenceTransformer(\"paraphrase-mpnet-base-v2\")\n",
        "vectors = encoder.encode(text)\n",
        "vector_dimension = vectors.shape[1]\n",
        "l2_index = faiss.IndexFlatL2(vector_dimension)\n",
        "faiss.normalize_L2(vectors)\n",
        "l2_index.add(vectors)"
      ],
      "metadata": {
        "id": "jbaTFSuy0sCk"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Prepare a search text to be used for similarity search with FAISS on the generated index."
      ],
      "metadata": {
        "id": "_bJ7Nzco023F"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "search_text = 'He throws webs'\n",
        "search_vector = encoder.encode(search_text)\n",
        "search_vector_as_array = np.array([search_vector])\n",
        "faiss.normalize_L2(search_vector_as_array)"
      ],
      "metadata": {
        "id": "TDRt6xSO2F5G"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Perform a search within the created index (calculation of the distances between the search text and the strings within the index)."
      ],
      "metadata": {
        "id": "jjX37kK42RGG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "k = l2_index.ntotal\n",
        "distances, ann = l2_index.search(search_vector_as_array, k=k)"
      ],
      "metadata": {
        "id": "XCC9wBQU2a0y"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Prepare the results to be displayed in a user-friendly format."
      ],
      "metadata": {
        "id": "jF6Ztv7T2g4t"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "search_results = pd.DataFrame({'distances': distances[0], 'ann': ann[0]})\n",
        "merged_df = pd.merge(search_results, df, left_on='ann', right_index=True)\n",
        "merged_df.head()"
      ],
      "metadata": {
        "id": "9oPKdHSE21yU"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}